Although some models have used more than a single risk factor, most research relies on traditional statistical approaches that restrict the number of variables that can be simultaneously examined, creating overly simplistic models (Franklin et al. 2017)
Theoretically, the processes that facilitate suicide morbidity are complex and entail multiple interactions; therefore, any risk factor considered in isolation will be an inaccurate predictor
A shift in research is needed to capture the complexities behind adolescent suicide morbidity
Using methods with better predictability performance
Risk algorithms instead of single risk factors
Machine learning in Suicidology
35 independent studies used ML to predict suicide-related events
More accurate levels of performance in predictions over traditional statistical methodology (AUCs = 0.80–0.84)
There is a scarcity of research using adolescent population
Research aims
Identify the critical risk factors for adolescent suicide morbidity from a set of 99 risk behavior predictors with machine learning classification algorithms.
Identify the best machine learning methodology to classify adolescents who attempted and considered suicide according to its classification performance (ROC, overall accuracy, and the Kappa value).
Compare the performance of an a priori-determined model to models informed by feature selection from the least absolute shrinkage and selection operator method.
Identify if there are differences in the critical risk factors for suicide ideation and suicide attempts.
Conceives human development as the constant interaction between the individual and the changing environment in which it lives and grows (Bronfenbrenner 1977).
Ontogenic
Sex
Race
Age
Microsystem
Family members
Friends
School
Exosystem
The media
Neighborhood
Macrosystem
Economic, social, educational, legal, and political systems
Allows to study adolescent suicide morbidity as the interaction of multiple risk factors at multiple levels of the adolescent system (Perkins and Hartless 2002).
Moves beyond the tendency to evaluate only individualistic characteristics of adolescents.
Surveys that monitors health behaviors and experiences among high school students in grades 9–12 attending U.S. public and private schools since 1991 (Underwood et al. 2020)
The main categories included in the surveys:
Behaviors that contribute to unintentional injury and violence
Tobacco use
Alcohol and other drug use
Sexual behaviors that contribute to unintended pregnancy and STD/HIV infection
The total weighted sample for the Combined YRBS High School Dataset is 14,395,146 cases.
From these, 7,159,104 are female, and 7,141,727 are male.
The proportion of students that reported attempting suicide in this data is 8%
The proportion of students that considered suicide is 15%.
Outcomes:
(Q26) During the past 12 months, did you ever seriously consider attempting suicide?
(Q28) During the past 12 months, how many times did you actually attempt suicide?
Predictors:
Demographic variables (age, sex, grade, race, sexual identity, site, year)
Questionnaire items (q8-q99)
Logistic Regression, Lasso, K-Nearest Neighbors, Random Forest, Classification and Regression Trees, and Extreme Gradient will be used to generate the predictive models
Model the outcome as a linear function of the predictors (Burkov 2019).
The sigmoid function is applied to adjust the predictions to stay between 0 and 1 (Burkov 2019)
The predictors will be selected from past literature modeling YRBSS data (Bae et al. 2005)
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Estimates the coefficients aiming at zero, this means that it uses shrinkage (James et al. 2013).
It is extremely useful for variable importance, selection, and regularization (James et al. 2013).
This technique will select only relevant coefficients (James et al. 2013).
KNN will assign class membership to a data point with the majority vote of its K-nearest neighbors.
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Divides the feature space into non-overlapping rectangular regions with similar response rates that can later be used for prediction (Greenwell 2022).
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Random forest consists of hundreds or thousands of independently grown decision trees generated from different bootstrap samples from the training data (Greenwell 2022).
Uses hundreds of trees in the back end and thus results in a more flexible boundary